Skip to content

Conversation

@krishung5
Copy link
Contributor

@krishung5 krishung5 commented Nov 7, 2025

Overview:

There are two separate but related fixes addressing different aspects of the memory leak/hang problem:

Issue 1: Race Condition in Finish Event Communication:

When vLLM finishes a request, it calls request_finished() on the leader. The leader and workers must coordinate to clean up GPU blocks, but there was a race condition where the leader would delete the slot before workers could process the finish event.

Time  | Leader Thread                    | Worker Thread
------|----------------------------------|----------------------------------
T1    | vLLM calls request_finished()    |
T2    | Mark slot as "Finishing"         |
      | (has 3 pending offload ops)      |
T3    | Remove slot from slot_manager    | 
      | Return to vLLM                   |
T4    |                                  | Receive finish event via metadata
T5    |                                  | Look for slot... NOT FOUND!
T6    |                                  | Skip cleanup (no slot)
T7    |                                  | Operations never complete
      |                                  | GPU blocks never freed
      |                                  | → MEMORY LEAK

Fix:
Keep Slots Alive in Finishing State - changes in leader.rs

  • Keeps the slot alive when it has pending operations (Finishing state)

Prevent New Operations After Finishing - changes in slot.rs

  • In offload_blocks, If we're finishing, don't create new operations.

Issue 2: Offload operations that failed didn't notify the scheduler

When an offload operation fails (e.g., "Not enough blocks available" due to CPU cache being full), the error is logged but no completion notification is sent to the scheduler. This leaves the scheduler's completed counter stuck, preventing slots from ever being marked as complete.

Time  | Leader Thread                    | Worker Thread
------|----------------------------------|----------------------------------
T1    | Offload operation created        |
T2    | Operation enqueued to workers    |
T3    | Worker starts processing         |
T4    | Operation fails:                 |
      | "Not enough blocks available"    |
T5    | Error logged                     |
      | NO completion notification!      |
T6    |                                  | Scheduler: completed=2, total=3
T7    |                                  | is_complete() returns false
T8    |                                  | Slot never becomes complete
T9    |                                  | GPU blocks never freed
      |                                  | → MEMORY LEAK

Fix:
Notify scheduler that this operation is "complete" - changes in slot.rs

  • Create a fake/immediate transfer request that completes instantly
  • This increments the workers' completed counter so they can progress

Closes DIS-848

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved request completion handling with proper state tracking and synchronization during the finalization process
    • Enhanced error handling in block transfer operations to ensure completion notifications are properly propagated even when issues occur
    • Added guards to safely handle operations during finalization stages
    • Strengthened type safety for internal transfer mechanisms

@krishung5
Copy link
Contributor Author

/ok to test 6097426

@krishung5 krishung5 force-pushed the krish/kvbm-mem-leak branch from e2e3b9f to 1b820f7 Compare November 7, 2025 15:30
@krishung5 krishung5 marked this pull request as ready for review November 7, 2025 15:58
@krishung5 krishung5 requested a review from a team as a code owner November 7, 2025 15:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

The changes enhance slot lifecycle management in the vLLM block manager connector. Updates include acquiring locks before marking slots finished, conditional slot removal based on state (Finished vs. Finishing), early-exit guards during finalization, explicit typing for transfer channels, and synthetic completion notifications on offload failures to ensure progress in error scenarios.

Changes

Cohort / File(s) Change Summary
Slot Request Completion and Lock Safety
lib/kvbm/src/block_manager/vllm/connector/leader.rs
request_finished now acquires slot lock before marking it finished via mark_as_finished. Slot removal is now conditional: only removed when state is Finished; kept alive when Finishing to allow pending operations. Added clarifying comments on locking and state transitions.
Slot State Guards and Transfer Improvements
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
Added early-exit guard in offload_blocks to no-op when slot is Finishing or Finished. Transfer channels now use explicit types (LocalOnboardRequest, LocalOffloadRequest). On offload failure, synthetic completion notification is emitted with captured request_id and operation_id to signal progress to scheduler.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Lock ordering and deadlock potential: Verify that the new lock acquisition in request_finished does not introduce deadlock scenarios with existing locking patterns.
  • State transition semantics: Ensure the Finishing vs. Finished distinction correctly prevents operations on finalizing slots while preserving pending work.
  • Synthetic request construction: Validate that the fake LocalOffloadRequest and BlockTransferRequest are constructed correctly and that captured identifiers flow through notification pathways without loss.
  • Error path coverage: Confirm that the synthetic completion mechanism handles all offload failure scenarios and that scheduler counters advance correctly.

Poem

🐰 A slot's lifecycle, now locked up tight,
With states that finish, or hold the light,
No more hasty removals in the fray,
Synthetic signals save the day!
Transfer types bloom where channels align,
Our vLLM connector feels divine! ✨

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title concisely describes the main fix: addressing a GPU memory leak in KVBM, which is the primary objective of the PR.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description provides comprehensive overview, detailed explanations of two separate issues with timing diagrams, specific fixes for each issue, and identifies key files for review. All required template sections are addressed with substantive content.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (1)

992-996: Early-exit guard prevents operations during finalization.

The guard correctly prevents new operations from being created when the slot is Finishing or Finished, addressing the race condition described in Issue 1. However, consider logging a debug message when this path is taken to aid observability and debugging, since callers may not be aware the operation was skipped.

Consider adding a debug log:

 // Check if slot is in Finishing state before creating operations
 // If we're finishing, don't create new operations
 if matches!(self.state, SlotState::Finishing | SlotState::Finished) {
+    tracing::debug!(
+        request_id = %self.request_id,
+        "skipping offload_blocks because slot is {:?}",
+        self.state
+    );
     return Ok(());
 }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f509493 and 1b820f7.

📒 Files selected for processing (2)
  • lib/kvbm/src/block_manager/vllm/connector/leader.rs (1 hunks)
  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (4 hunks)
🧰 Additional context used
🧠 Learnings (6)
📚 Learning: 2025-09-03T19:31:32.621Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 2840
File: lib/llm/src/kv_router/sequence.rs:86-88
Timestamp: 2025-09-03T19:31:32.621Z
Learning: PeaBrane chose to defer fixing the corner case where a single late-arriving request might never expire in the ActiveSequences expiry mechanism (lib/llm/src/kv_router/sequence.rs). They prefer to avoid adding a background loop for periodic cleanup at this time, accepting the technical debt to keep the current PR scope contained.

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader.rs
📚 Learning: 2025-09-18T21:41:02.263Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 2989
File: lib/llm/src/block_manager/distributed/transfer.rs:59-60
Timestamp: 2025-09-18T21:41:02.263Z
Learning: The codebase has a robust two-layer transfer management system: TransferBatcher (offload/pending.rs:400-485) handles batching large transfers into MAX_TRANSFER_BATCH_SIZE chunks, and LocalTransferManager (offload/pending.rs:280-286) limits concurrency to MAX_CONCURRENT_TRANSFERS using FuturesUnordered.

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-09-17T01:00:50.937Z
Learnt from: PeaBrane
Repo: ai-dynamo/dynamo PR: 3077
File: lib/llm/src/kv_router/subscriber.rs:334-336
Timestamp: 2025-09-17T01:00:50.937Z
Learning: PeaBrane identified that reordering tokio::select! arms in the indexer (moving dump_rx.recv() to be after event_rx.recv()) creates a natural barrier that ensures RouterEvents are always processed before dump requests, solving the ack-before-commit race condition. This leverages the existing biased directive and requires minimal code changes, aligning with their preference for contained solutions.

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-06-02T19:37:27.666Z
Learnt from: oandreeva-nv
Repo: ai-dynamo/dynamo PR: 1195
File: lib/llm/tests/block_manager.rs:150-152
Timestamp: 2025-06-02T19:37:27.666Z
Learning: In Rust/Tokio applications, when background tasks use channels for communication, dropping the sender automatically signals task termination when the receiver gets `None`. The `start_batching_publisher` function in `lib/llm/tests/block_manager.rs` demonstrates this pattern: when the `KVBMDynamoRuntimeComponent` is dropped, its `batch_tx` sender is dropped, causing `rx.recv()` to return `None`, which triggers cleanup and task termination.

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-05-29T06:20:12.901Z
Learnt from: ryanolson
Repo: ai-dynamo/dynamo PR: 1093
File: lib/llm/src/block_manager/block/registry.rs:98-122
Timestamp: 2025-05-29T06:20:12.901Z
Learning: In lib/llm/src/block_manager/block/registry.rs, the background task spawned for handling unregister notifications uses detached concurrency by design. The JoinHandle is intentionally not stored as this represents a reasonable architectural tradeoff for a long-running cleanup task.

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
📚 Learning: 2025-06-04T18:43:04.566Z
Learnt from: jthomson04
Repo: ai-dynamo/dynamo PR: 1363
File: lib/llm/src/block_manager/block/transfer.rs:206-216
Timestamp: 2025-06-04T18:43:04.566Z
Learning: For NIXL transfers in the KVBM system, the future returned by `nixl::write_blocks_to` is independent of the underlying transfer execution. The transfer begins immediately when `nixl::write_blocks_to` is called, and the returned future is only used for notification/completion tracking. Therefore, it's safe to drop the future when notification is not needed (`notify == false`).

Applied to files:

  • lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs
🧬 Code graph analysis (1)
lib/kvbm/src/block_manager/vllm/connector/leader.rs (1)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (4)
  • request_id (94-94)
  • request_id (431-433)
  • state (96-96)
  • state (435-437)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: sglang (amd64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: sglang (arm64)
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (5)
lib/kvbm/src/block_manager/vllm/connector/leader/slot.rs (3)

1182-1183: LGTM: Explicit channel typing improves clarity.

The explicit typing of transfer channels enhances code readability and type safety without altering functionality.


1222-1223: LGTM: Captures IDs for synthetic completion notification.

Correctly captures request_id and operation_id before the request is moved, enabling the synthetic completion pathway on failure.


1234-1258: Synthetic completion notification with empty blocks is safe and correctly addresses the failure scenario.

The implementation properly handles offload processing failures:

  • Empty BlockTransferRequest with RequestType::Immediate completes instantly through the completion channel
  • ImmediateTransferCompletionHandle sends the completion notification regardless of block count, ensuring the scheduler's completed counter increments even on failure
  • The Drop impl (protocol.rs:276-288) enforces that mark_complete() must be called, preventing silent completion tracking failures
  • Silent error handling (lines 1255-1257) is acceptable since the error is already logged above

The solution correctly ensures scheduler progress even when offload processing fails.

lib/kvbm/src/block_manager/vllm/connector/leader.rs (2)

510-518: LGTM: Lock acquisition prevents race condition.

Acquiring the lock before marking the slot as finished (line 518) is critical to preventing the race condition described in Issue 1. This ensures that the state transition to Finishing happens atomically with respect to any operation creation attempts in offload_blocks(). Combined with the early-exit guard added to slot.rs (lines 992-996), this prevents workers from missing the finish event because the slot was removed before they could process pending operations.


527-538: Code is correct - no changes needed.

The implementation properly handles the Finishing state and prevents memory leaks:

  1. Worker's is_complete() checks that all operations are complete by comparing completed count to total operations
  2. The remove_slot() method includes a defensive assertion requiring slot.is_complete() to be true before removal
  3. Leader extracts pending operations via take_pending_operations() and adds them to the scheduler so they complete before is_complete() returns true
  4. Worker panics if a slot unexpectedly disappears while in the maybe_finished set, catching protocol violations

The conditional removal logic correctly differentiates between Finishing (operations pending) and Finished (all resources released), ensuring blocks remain available while workers process pending operations.

Copy link
Contributor

@ziqifan617 ziqifan617 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work!

Comment on lines 1235 to 1237
// Notify scheduler that this operation is "complete" (even though it failed)
// Create a fake/immediate transfer request that completes instantly
// This increments the workers' completed counter so they can progress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Notify scheduler that this operation is "complete" (even though it failed)
// Create a fake/immediate transfer request that completes instantly
// This increments the workers' completed counter so they can progress
// Create a fake/immediate transfer request that completes instantly.
// Otherwise, worker side might stuck and cause memory leak.

Comment on lines +516 to +518
// Mark the slot as finished (sets state to Finishing if there are operations,
// or Finished if all operations are complete)
slot.mark_as_finished(self.iteration_counter)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, by checking the code, I am wondering where slot would be marked as SlotState::Finished. I only saw places to make it as SlotState::Finishing? will follow up offline

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the catch. Resolved offline. Fixed in the latest commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants